{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# COMPSCI 389: Introduction to Machine Learning\n", "# PyTorch and Overfitting\n", "\n", "This notebook provides an introduction to Pytorch, and discusses the phenomenon of **overfitting**. So far we have covered the use of autograd to automatically differentiate Python functions. While autograd is a powerful tool, it can be quite slow. Since training ML models can be computationally intensive, using autograd for even medium-sized models may be too slow.\n", "\n", "Other libraries have been developed specifically for automatic differentiation for ML. Two of the most popular libraries are [PyTorch](https://pytorch.org/) and [Tensorflow](https://www.tensorflow.org/). Tensorflow is a product of Google, and integrates nicely with Google's cloud computing platforms. However, it has a steeper learning curve and more verbose syntax. PyTorch is currently more commonly used ([here](https://trends.google.com/trends/explore?date=today%205-y&geo=US&q=tensorflow,pytorch&hl=en) is a comparison using Google Trends). Other altnernatives include [Keras](https://keras.io/), [Caffe](https://caffe.berkeleyvision.org/), and [MXNet](https://mxnet.apache.org/versions/1.9.1/). In this course, we will use PyTorch. \n", "\n", "PyTorch (and other deep learning libraries) provides a few benefits over autograd:\n", "\n", "1. It is highly optimized. Whereas autograd executed Python code, PyTorch relies on lower-level compiled implementations of the functions necessary to implement and train artificial neural network models.\n", "2. It was designed for training artificial neural networks.\n", " - It includes default implementations of standard layers. For example, you can use high-level scripting languages to add convolutional layers, pooling layers, and fully connected layers. You can also select between pre-defined common loss functions and activation functions.\n", "\n", "You can install PyTorch with:\n", "\n", "> pip install torch torchvision" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use the following imports:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# New to this topic:\n", "import torch\n", "import torch.nn as nn # For defining our neural network model\n", "import torch.optim as optim # For training the model using data\n", "from torch.utils.data import TensorDataset, DataLoader # For making mini-batches\n", "\n", "# From before:\n", "import pandas as pd\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import StandardScaler\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Defining a network architecture (parametric model) in PyTorch\n", "\n", "Neural network architectures (parametric models) are represented as subclasses that extend `nn.Module` (the base class for all neural network modules in PyTorch). `nn.Module` provides a range of built in functionalities, such as keeping track of trainable parameters, moving parameters and buffers to the GPU for GPU acceleration (more on this later!), saving and loading models, and more.\n", "\n", "To create a model, we need to define the constructor, `__init__` and a function for computing the output of the model given an input, called `forward` (since this is a forwards pass).\n", "\n", "- `__init__`: Inside of the constructor, we specify the structure of the parametric model. We do this by defining the different layers that will be used, their sizes, and the different activation functions that will be used.\n", "- `forward`: Inside of the forward function, we specify how the different layers are ordered and where the activation functions are applied.\n", "\n", "We do not need to specify any derivatives or anything about the backwards pass - this is all automatic!\n", "\n", "We will create a network with three hidden layers. This network is bigger than what is needed for the GPA prediction problem. We are using a relatively large network to better show the advantages of training on a GPU and to show something called \"overfitting\" later." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "class FullyConnectedNetwork(nn.Module):\n", " def __init__(self):\n", " # First call the nn.Module constructor to initialize other parts of the model. Always do this first.\n", " super(FullyConnectedNetwork, self).__init__()\n", "\n", " # Define layers. The lines below create the layers (memory is allocated for the weights here).\n", " self.fc1 = nn.Linear(9, 1024) # First hidden layer with 1024 neurons and 9 inputs.\n", " self.fc2 = nn.Linear(1024, 512) # Second hidden layer with 512 neurons and 1024 inputs.\n", " self.fc3 = nn.Linear(512, 128) # Third hidden layer with 128 neurons and 512 inputs.\n", " self.fc4 = nn.Linear(128, 1) # Output layer with 1 neuron and 128 inputs.\n", "\n", " # Define activation function. You could skip this step and use nn.ReLU in the forward pass,\n", " # but that would be *slightly* less efficient. For small models it would likely be fine, but\n", " # it's best practice to create the activation function object once in the constructor. Note that\n", " # this object can be re-used any time a ReLU activation function is needed \n", " # (we don't need many relu objects)\n", " self.relu = nn.ReLU()\n", "\n", " def forward(self, x):\n", " # Pass data through the network\n", " x = self.relu(self.fc1(x))\n", " x = self.relu(self.fc2(x))\n", " x = self.relu(self.fc3(x))\n", " x = self.fc4(x) # No activation after the output layer\n", " return x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now create an instance of this model:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "FullyConnectedNetwork(\n", " (fc1): Linear(in_features=9, out_features=1024, bias=True)\n", " (fc2): Linear(in_features=1024, out_features=512, bias=True)\n", " (fc3): Linear(in_features=512, out_features=128, bias=True)\n", " (fc4): Linear(in_features=128, out_features=1, bias=True)\n", " (relu): ReLU()\n", ")\n" ] } ], "source": [ "# Create an instance of the network\n", "net = FullyConnectedNetwork()\n", "\n", "# The network structure is printed as a sanity check\n", "print(net)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `bias=True` terms indicates that each perceptron includes an extra feature that is always equal to 1 (and hence one extra weight beyond the number of outputs from the previous layer). This is what we discussed previously when we talked about appending a 1 to the columns of a data set to implement the \"y-intercept\" in linear regression. For perceptrons and neural networks, this extra weight is called the **bias**.\n", "\n", "Next, let's load the GPA data, split it into training and testing, and standardize it." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"https://people.cs.umass.edu/~pthomas/courses/COMPSCI_389/GPA.csv\", delimiter=',') # Read GPA.csv, assuming numbers are separated by commas\n", "#df = pd.read_csv(\"data/GPA.csv\", delimiter=',')\n", "\n", "# We already loaded X and y, but do it again as a reminder\n", "X = df.iloc[:, :-1]\n", "y = df.iloc[:, -1]\n", "\n", "# Split the data into training and testing sets\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)\n", "\n", "# Standardize the features\n", "scaler = StandardScaler()\n", "X_train = scaler.fit_transform(X_train) # This sets the min/max values from the training data (without looking at the testing)\n", "X_test = scaler.transform(X_test) # This uses the min/max scaling values chosen during training! (transform, not fit_transform)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PyTorch has its own objects for storing data, called PyTorch tensors. These are simply multidimensional arrays. Let's convert our data to these tensor objects. Note that the `tensor` constructor is not compatible with `pandas.Series` objects, so we call `y_train.values` and `y_test.values` to convert these to `numpy.ndarray` objects." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Convert data to PyTorch tensors\n", "X_train_tensor = torch.tensor(X_train, dtype=torch.float32)\n", "y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).view(-1,1)\n", "X_test_tensor = torch.tensor(X_test, dtype=torch.float32)\n", "y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32).view(-1,1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above, the `.view(-1,1)` on the labels reshapes the tensor. The first argument says to automatically calculate the size of the first dimension, and the second argument says taht the second dimension should be 1. Without `.view(-1,1)`, the `y_train_tensor` and `y_test_tensor` would be flat 1-dimensional tensors. The built-in MSE loss function that we use later expects a 2-dimensional tensor. The call to `.view(-1,1)` convert the 1-dimensional tensor into a 2-dimensional tensor (that happens to only have one column). Without this line, the training code below should preduce a warning pointing out that a 2-dimensional \"target\" (label) was expected, and that incorrect outputs could result.\n", "\n", "Note that PyTorch does not have its own functions for reading data from CSV files, nor for performing train-test splits. It is therefore common to use Pandas, Scikit-Learn, and PyTorch together like this.\n", "\n", "For now, we will keep the data as tensor objects. Later, we will use PyTorch's own dataset representation, `TensorDataset`, which is particularly useful when training on a GPU.\n", "\n", "Next, let's define the loss function that we would like to minimize (the sample MSE). PyTorch has common loss functions built in, so we do not need to re-implement the loss function ourselves!" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "loss_function = nn.MSELoss()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's choose an algorithm for solving the optimization problem. There are many built in optimizers. For example, we could perform gradient descent with:\n", "\n", "```\n", "optimizer = optim.SGD(net.parameters(), lr=0.01)\n", "```\n", "\n", "Here, `lr` is the learning rate. We could also include a momentum parameter `momentum=0.9`. Other common optimizers include:\n", "\n", "1. RMSprop: This is particularly useful for recurrent neural networks.\n", " ```\n", " optimizer = torch.optim.RMSprop(net.parameters(), lr=0.01)\n", " ```\n", "2. Adagrad: This is particularly useful for sparse data (data where most values are zero).\n", " ```\n", " optimizer = torch.optim.Adagrad(net.parameters(), lr=0.01)\n", " ```\n", "3. Adadelta: This does not require a learning rate parameter.\n", " ```\n", " optimizer = torch.optim.Adadelta(net.parameters())\n", " ```\n", "4. AdamW: A variant of Adam that includes weight decay.\n", " ```\n", " optimizer = torch.optim.AdamW(net.parameters(), lr=0.01)\n", " ```\n", "5. SGD with momentum:\n", " ```\n", " optimizer = torch.optim.SGD(net.parameters(), lr=0.01, momentum=0.9)\n", " ```\n", "\n", "We will use the Adam optimizer, which is currently the most popular optimizer for deep learning." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "optimizer = optim.Adam(net.parameters())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can run the training loop. For each epoch:\n", "\n", "1. We will call `net(X_train_tensor)` to run the forward pass for each row in `X_train_tensor`.\n", "2. We will Compute the resulting loss. This is a forward pass of the loss function, and is necessary!\n", "3. We compute a backward pass starting from the loss function with `loss.backward()`. This computes the gradient with respect to each model parameter.\n", " - **Note**: Each model parameter has a `.grad` attribute storing the gradient of the loss w.r.t. that parameter. When `.backward()` is called, the gradients for each parameter are accumulated in this `.grad` attribute. That is, they are added to whatever is currently stored in `.grad`! This can be useful in more advanced architectures where gradients from different sources are combined. \n", " - This also means that we need to clear the `.grad` parameter at the start of each epoch, so that we do not sum up the gradients across all epochs. This is achieved with `optimizer.zero_grad()`.\n", "4. We update the weights using the optimizer via `optimizer.step()`" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch [0/100], Loss: 8.3453\n", "Epoch [10/100], Loss: 1.4491\n", "Epoch [20/100], Loss: 0.8890\n", "Epoch [30/100], Loss: 0.7326\n", "Epoch [40/100], Loss: 0.7181\n", "Epoch [50/100], Loss: 0.6600\n", "Epoch [60/100], Loss: 0.6356\n", "Epoch [70/100], Loss: 0.6110\n", "Epoch [80/100], Loss: 0.5917\n", "Epoch [90/100], Loss: 0.5783\n" ] } ], "source": [ "epochs = 100 # The number of epochs to run\n", "for epoch in range(epochs):\n", " # Zero the gradients\n", " optimizer.zero_grad()\n", " \n", " # Forward pass\n", " y_pred = net(X_train_tensor)\n", "\n", " # Compute loss\n", " loss = loss_function(y_pred, y_train_tensor)\n", "\n", " # Backward pass and optimize\n", " loss.backward()\n", " optimizer.step()\n", "\n", " # Print statistics\n", " if epoch % 10 == 0:\n", " print(f'Epoch [{epoch}/{epochs}], Loss: {loss.item():.4f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On my desktop, this took 32.7 seconds. We will revisit this later.\n", "\n", "We can then evaluate the learned model on the testing data. When we do this, we do not need to store the gradient information used for training, so we wrap the code in:\n", "```\n", "with torch.no_grad():\n", " ...\n", "```\n", "This tells PyTorch that when running the model, it doesn't need to store information during the forward passes. Rather, we will only be using the output of the parametric model." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test Loss: 0.5853\n" ] } ], "source": [ "# Evaluate the model with test data\n", "with torch.no_grad():\n", " y_pred_test = net(X_test_tensor)\n", " test_loss = loss_function(y_pred_test, y_test_tensor)\n", " print(f'Test Loss: {test_loss.item():.4f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice:\n", "1. This may be slower than training the linear parametric model. That isn't because PyTorch is slower than AutoGrad, but rather because we're training a significantly more complex model.\n", "2. This was running on the CPU. PyTorch automatically uses multithreading, so it used all of the available cores during training.\n", "\n", "### Training ML Models on the GPU\n", "\n", "For such a small model, training on the CPU is sufficient. For bigger models and data sets, you may want to train on the GPU. In class we will discuss the benefits of training ML models on the GPU. PyTorch makes training on the GPU relatively simple. We simply need to:\n", "\n", "1. Install CUDA. You can download the \"CUDA Toolkit\" from NVIDIA [here](https://developer.nvidia.com/cuda-toolkit). Before doing so, look at the versions of CUDA that PyTorch supports using the link in the next instruction. Right now, the latest version is CUDA 12.1, which you can download [here](https://developer.nvidia.com/cuda-12-1-0-download-archive).\n", "2. Confirm that your PyTorch installation is compatible with your version of CUDA. You can get the appropriate PyTorch installation commands for your version of CUDA [here](https://pytorch.org/get-started/locally/). For example, if you want PyTorch 2.2.1 on Windows, installed using pip, and using Cuda 12.1, the installation command is:\n", " > pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121\n", "3. In the code:\n", " - Check if CUDA (GPU support) is available.\n", " > device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", " - Move the network to the GPU.\n", " > net.to(device)\n", " - Move the training data to the GPU.\n", " > X_train_tensor = X_train_tensor.to(device)\n", " \n", " > y_train_tensor = y_train_tensor.to(device)\n", " - When we are done training the model, we can then move it back to the CPU:\n", " > net.to('cpu')\n", "\n", "We can do all of this with the following lines:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "device(type='cuda')" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch [0/100], Loss: 8.6576\n", "Epoch [10/100], Loss: 1.3355\n", "Epoch [20/100], Loss: 1.0290\n", "Epoch [30/100], Loss: 0.7493\n", "Epoch [40/100], Loss: 0.7279\n", "Epoch [50/100], Loss: 0.6782\n", "Epoch [60/100], Loss: 0.6503\n", "Epoch [70/100], Loss: 0.6216\n", "Epoch [80/100], Loss: 0.6011\n", "Epoch [90/100], Loss: 0.5857\n" ] } ], "source": [ "net = FullyConnectedNetwork() # Create a new network to train from scratch\n", "optimizer = optim.Adam(net.parameters()) # Create the optimizer for this network\n", "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\") # Check if CUDA (GPU) available\n", "display(device) # Confirm that the GPU is being used\n", "\n", "net.to(device) # Move the network to GPU if available\n", "X_train_tensor = X_train_tensor.to(device) # Also move the tensors to the chosen device\n", "y_train_tensor = y_train_tensor.to(device)\n", "\n", "epochs = 100 # Number of epochs\n", "for epoch in range(epochs):\n", " optimizer.zero_grad() # Zero the gradients\n", " y_pred = net(X_train_tensor) # Forward pass\n", " loss = loss_function(y_pred, y_train_tensor) # Compute the loss for printing/plotting\n", " loss.backward() # Backwards pass\n", " optimizer.step() # Update the weights using the optimizer\n", " if epoch % 10 == 0: # Print statistics\n", " print(f'Epoch [{epoch}/{epochs}], Loss: {loss.item():.4f}')\n", "\n", "net.to('cpu'); # Move the model back to the CPU" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wow, that was fast! It took 2.6 seconds (GPU/CUDA) rather than more than 32.7 seconds (CPU)! For bigger models and data sets, this improvement can be even more extreme. In class we will discuss why it is often faster to train large ML models on the GPU. For now, let's confirm that we get a similar MSE on the testing set." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test Loss: 0.5872\n" ] } ], "source": [ "# Evaluate the model with test data (optional)\n", "with torch.no_grad():\n", " y_pred_test = net(X_test_tensor)\n", " test_loss = loss_function(y_pred_test, y_test_tensor)\n", " print(f'Test Loss: {test_loss.item():.4f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using mini-batches\n", "\n", "As we saw before, mini-batches can speed up optimization, resulting in better model parameters from fewer epochs. PyTorch provides a `DataLoader` class that can be used to automatically divide data into mini-batches. \n", "\n", "In order to use a DataLoader, we need to convert the data set into a TensorDataset. We do this with:\n", "```\n", "train_dataset = TensorDataset(X_train_tensor, y_train_tensor)\n", "```\n", "\n", "We can then create the DataLoader object with:\n", "```\n", "train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)\n", "```\n", "\n", "This DataLoader object is the standard method for passing data to a model that is being trained on a GPU. It simplifies a few things:\n", "1. It provides methods for performing batching (using mini-batches).\n", "2. It provides methods for shuffling the data\n", "3. It can use multiple threads to prepare and send data to the model. This is useful for very large data sets.\n", " - This also minimizes the time that the GPU spends waiting for data to be provided.\n", "\n", "For example, we can iterate over batches within an epoch with:\n", "```\n", "for X_batch, y_batch in train_loader:\n", "```\n", "We can then move these batches to the GPU with:\n", "```\n", "X_batch, y_batch = X_batch.to(device), y_batch.to(device)\n", "```\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch [0/100], Average Loss: 0.6923\n", "Epoch [10/100], Average Loss: 0.5668\n", "Epoch [20/100], Average Loss: 0.5559\n", "Epoch [30/100], Average Loss: 0.5470\n", "Epoch [40/100], Average Loss: 0.5356\n", "Epoch [50/100], Average Loss: 0.5185\n", "Epoch [60/100], Average Loss: 0.4981\n", "Epoch [70/100], Average Loss: 0.4797\n", "Epoch [80/100], Average Loss: 0.4556\n", "Epoch [90/100], Average Loss: 0.4359\n" ] } ], "source": [ "net = FullyConnectedNetwork()\n", "optimizer = optim.Adam(net.parameters())\n", "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", "net.to(device)\n", "\n", "# Create a TensorDataset and DataLoader\n", "train_dataset = TensorDataset(X_train_tensor, y_train_tensor)\n", "train_loader = DataLoader(dataset=train_dataset, batch_size=100, shuffle=True)\n", "\n", "epochs = 100\n", "for epoch in range(epochs):\n", " total_loss = 0.0 # To sum the loss over all batches\n", " num_batches = 0 # A lazy way to get the number of batches: count them\n", " for X_batch, y_batch in train_loader: # Iterate over mini-batches\n", " X_batch, y_batch = X_batch.to(device), y_batch.to(device) # Move batches to GPU\n", " optimizer.zero_grad()\n", " y_pred = net(X_batch)\n", " loss = loss_function(y_pred, y_batch)\n", " total_loss += loss.item()\n", " num_batches += 1\n", " loss.backward()\n", " optimizer.step()\n", "\n", " # Calculate the average loss over all mini-batches in this epoch\n", " average_loss = total_loss / num_batches\n", "\n", " if epoch % 10 == 0: # Print statistics\n", " print(f'Epoch [{epoch}/{epochs}], Average Loss: {average_loss:.4f}')\n", "net.to('cpu'); # Move the model back to CPU if needed\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While 100 epochs took longer, far more gradient updates where performed. Notice that the training loss reached lower values in far fewer epochs. So, the \"time to complete 100 epochs\" is not a particularly fair metric.\n", "\n", "Below we compute the loss on the held out testing set." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test Loss: 0.7318\n" ] } ], "source": [ "# Evaluate the model with test data (optional)\n", "with torch.no_grad():\n", " y_pred_test = net(X_test_tensor)\n", " test_loss = loss_function(y_pred_test, y_test_tensor)\n", " print(f'Test Loss: {test_loss.item():.4f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the training loss was significantly lower than the testing loss. We observed this previously with the nearest neighbor methods. Let's investigate this further by plotting the training and testing loss after each epoch." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch [0/100], Training Loss: 0.6952, Testing Loss: 0.5909\n", "Epoch [10/100], Training Loss: 0.5657, Testing Loss: 0.5853\n", "Epoch [20/100], Training Loss: 0.5565, Testing Loss: 0.6122\n", "Epoch [30/100], Training Loss: 0.5487, Testing Loss: 0.5915\n", "Epoch [40/100], Training Loss: 0.5362, Testing Loss: 0.6002\n", "Epoch [50/100], Training Loss: 0.5193, Testing Loss: 0.6233\n", "Epoch [60/100], Training Loss: 0.4986, Testing Loss: 0.6340\n", "Epoch [70/100], Training Loss: 0.4755, Testing Loss: 0.6614\n", "Epoch [80/100], Training Loss: 0.4476, Testing Loss: 0.6838\n", "Epoch [90/100], Training Loss: 0.4200, Testing Loss: 0.7116\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Create a new network to train (the old one already had trained weights, and we want to start from scratch).\n", "net = FullyConnectedNetwork()\n", "\n", "# Create the optimizer for this new network\n", "optimizer = optim.Adam(net.parameters())\n", "\n", "# Check if CUDA (GPU support) is available\n", "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", "\n", "# Move the network into GPU if available\n", "net.to(device)\n", "\n", "# Send the testing data to the GPU\n", "X_test_tensor = X_test_tensor.to(device)\n", "y_test_tensor = y_test_tensor.to(device)\n", "\n", "# Convert the training and testing data to TensorDatasets\n", "train_dataset = TensorDataset(X_train_tensor, y_train_tensor)\n", "\n", "# Create a DataLoader to handle mini-batch loading\n", "batch_size = 100\n", "train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)\n", "\n", "epochs = 100 # Number of epochs\n", "training_losses = []\n", "testing_losses = []\n", "\n", "for epoch in range(epochs):\n", " net.train() # Set the network into training mode\n", "\n", " # We will track the total loss across all mini-batches in this epoch\n", " total_loss = 0.0\n", " num_batches = 0\n", "\n", " for X_batch, y_batch in train_loader:\n", " # Move batches to the same device as model\n", " X_batch, y_batch = X_batch.to(device), y_batch.to(device)\n", "\n", " # Zero the gradients\n", " optimizer.zero_grad()\n", " \n", " # Forward pass\n", " y_pred = net(X_batch)\n", "\n", " # Compute loss\n", " loss = loss_function(y_pred, y_batch)\n", " total_loss += loss.item()\n", " num_batches += 1\n", "\n", " # Backward pass and update\n", " loss.backward()\n", " optimizer.step()\n", "\n", " # Calculate the average loss over all mini-batches in this epoch\n", " average_train_loss = total_loss / num_batches\n", " training_losses.append(average_train_loss)\n", "\n", " # Evaluate the model on the test data\n", " net.eval() # Set the network to evaluation mode\n", " with torch.no_grad():\n", " # Forward pass on the entire test data\n", " y_pred_test = net(X_test_tensor)\n", "\n", " # Compute the loss on the entire test data\n", " test_loss = loss_function(y_pred_test, y_test_tensor).item()\n", " testing_losses.append(test_loss)\n", "\n", " if epoch % 10 == 0:\n", " print(f'Epoch [{epoch}/{epochs}], Training Loss: {average_train_loss:.4f}, Testing Loss: {test_loss:.4f}')\n", "\n", "# Plotting the training and testing losses\n", "plt.plot(training_losses, label='Training Loss')\n", "plt.plot(testing_losses, label='Testing Loss')\n", "plt.xlabel('Epoch')\n", "plt.ylabel('Loss')\n", "plt.title('Training and Testing Loss Over Epochs')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Overfitting\n", "\n", "Notice that the error on the training set keeps decreasing, but the error on the testing set increases! This is a phenomenon called **overfitting**, which will be discussed in lecture. Below are some relevant plots for this discussion.\n", "\n", "First, we plot 10 points from the line $y=x$ with Gaussian noise added:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "x = np.linspace(0, 10, 10) # Generate 10 random x values\n", "y = x + np.random.normal(0, 1, 10) # Add noise\n", "\n", "# Plot\n", "plt.scatter(x, y, marker='.')\n", "plt.xlabel('x')\n", "plt.ylabel('y')\n", "plt.title('Plot of y = x + Gaussian noise')\n", "plt.grid(True)\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's plot the least squares fit using the 10th degree Polynomial basis." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\pthomas\\AppData\\Local\\Temp\\ipykernel_6512\\3960549941.py:2: RankWarning: Polyfit may be poorly conditioned\n", " coefficients = np.polyfit(x, y, 10)\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Perform a least squares fit with a 10th degree polynomial basis\n", "coefficients = np.polyfit(x, y, 10)\n", "polynomial_fit = np.polyval(coefficients, np.linspace(0, 10, 100))\n", "\n", "# Plotting the points as dots and the polynomial fit curve\n", "plt.scatter(x, y, marker='.')\n", "plt.plot(np.linspace(0, 10, 100), polynomial_fit, color='red') # Polynomial fit curve\n", "plt.xlabel('x')\n", "plt.ylabel('y')\n", "plt.title('10th Degree Polynomial Fit to Points with Gaussian Noise')\n", "plt.grid(True)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This polynomial achieves a training error of zero - it goes precisely through each point! However, the test error will not be zero (remember, these are points from $y=x$ with Gausssian noise added). Here is the least squares linear fit:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Perform a linear fit (least squares)\n", "linear_coefficients = np.polyfit(x, y, 1)\n", "linear_fit = np.polyval(linear_coefficients, x)\n", "\n", "# Plotting the points as dots, the linear fit, and the polynomial fit curve\n", "plt.scatter(x, y, marker='.')\n", "plt.plot(x, linear_fit, color='green', label='Linear Fit') # Linear fit line\n", "plt.plot(np.linspace(0, 10, 100), polynomial_fit, color='red', label='10th Degree Polynomial Fit') # Polynomial fit curve\n", "plt.xlabel('x')\n", "plt.ylabel('y')\n", "plt.title('Linear and 10th Degree Polynomial Fit to Points with Gaussian Noise')\n", "plt.legend()\n", "plt.grid(True)\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 2 }